The Logic Backbone of a Transcription Network 
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A great part of the effort in the study of coarse grained models of transcription networks concentrates on their 
dynamical features. In this letter, we consider their equilibrium properties, showing that the backbone underly- 
ing the dynamic descriptions is an optimization problem. It involves variables, the gene expression levels, 
and M constraints, the effects of transcriptional regulation. In the case of Boolean variables and constraints, we 
investigate the structure of the solutions, and derive phase diagrams. Notably, the model exhibits a connectivity 
transition between a regime of simple gene control, where the input genes control 0(1) other genes, to a regime 
of complex control, where some "core" input genes control 0(A'^) others. 

PACS numbers: 87.10-l-e,89.75.Fb,89.75.Hc 



Introduction. Identity, response and architecture of a liv- 
ing system are central topics of molecular biology. Presently, 
they are largely seen as a result of the interplay between a 
gene repertoire and the regulatory machinery ^ 0] • Gene 
transcription in mRNA form is an important step in this pro- 
cess. At this level, the regulatory machinery is embodied 
by the transcription factors, proteins that bind to special sites 
along DNA and control the activity of RNA polymerase 
(Fig.0. This process is referred to as signal integration. To- 
gether, the cii-regulatory regions establish a set of interde- 
pendencies between transcription factors and genes, includ- 
ing other transcription factors: a "transcription network" |JJ. 
Some of such networks of living organisms are now being ex- 
plored experimentally f^, and show a modularity that has im- 
portant biological implications UJ. Understanding the gene 
expression patterns determined by these networks is an enor- 
mous challenge. The problem is that transcription networks 
are fairly large, so that a coarse graining is needed. This 
fact has many consequences, mainly related to the dynamics. 
For example, the pioneering approach of Kauffman |8] sug- 
gesting a synchronous deterministic update for a Boolean (i.e. 
on/off) representation of the network is still being debated 1 9 ] . 
Microscopically, it is well accepted that the Gillespie algo- 
rithm 1 6, 7] correctly describes the events of chemical kinet- 
ics involved. On the other hand, with a mesoscopic average in 
time, it is not clear what the emergent time scales might be. 

We approach this problem with a model based on two fea- 
tures. Firstly, it focuses, rather than on dynamics, on the com- 
patibility between gene expression patterns and signal inte- 
gration functions. Simply put, a cell with N genes can ex- 
press them in exponentially many ways, 2^ in the Boolean 
representation. However, the cell never explores all the pat- 
terns of expression. It only knows clusters of correlated con- 
figurations. An elementary example is the cl-cro switch of 
A-phage 0]. Looking at the system, one could observe the 
states 10, 01 or perhaps 00, but not 11. One can think that 
in larger systems many configurations are ruled out for the 
same compatibility reasons. Secondly, the model takes explic- 
itly into account that some genes are essentially "free" from 
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FIG. I: Schematics of the representation of a transcription network. 
Each signal integration function at the c/i-regulatory region of a gene 
corresponds to a constraint on the gene expression variables. Bottom: 
example of factor graph of GRI for Kb = K — 2. Each diamond 
node represents a constraint, while each black circle is a variable. 



the point of view of transcription (Fig. |2j. This fact is evi- 
dent looking at the available data |5]. While the biological 
situation is more complex, we regard these genes as input re- 
ceptors, connected to external stimuh. The simplest formula- 
tion (GRI, from Gene-Regulation) assumes Boolean variables 
and functions. We use it to investigate theoretically the con- 
trol exerted by the free genes on the expression patterns for 
large N, and for random realizations of the constraints fl^. 
Analysis of the satisfying configurations leads to the intro- 
duction of a "core" of network variables. Depending on the 
number of free genes in the core and the connectivity of the 
constraints, there are three distinct regimes of gene control. 
In the first regime, the core is empty. Each free gene con- 
trols the state of a small number of genes ("simple control" 
phase). In the second regime ("complex control"), the core 
contains free genes that control, both directly and indirectly, 
order N others. Thus, in the complex control phase, the free 
core genes can be interpreted as the subset of genes that deter- 
mine a choice of an expression program. In the third regime, 
there are no free genes in the core, and the system cannot con- 
trol the simultaneous expression of all its genes. The transi- 
tion can be tuned by varying the connectivity and the number 
of constraints. 
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Model. The two main ingredients of our representation of 
a transcription network are: (i) A set of N discrete variables 
{xi}i=i..N associated to genes or operons (identified with 
their transcripts and protein products). These variables repre- 
sent the expression levels, (ii) A set of M interactions, repre- 
senting the signal integration, {Ib{xbo,Xbi, ■■,Xb,^J}b=i..M, 
with 7 = M/N < 1. It is useful to represent variables 
and interactions in a so-called factor graph, as illustrated in 
Fig-H Note that the constraints contain the topology of the 
graph. The Xi represent real or coarse-grained expression lev- 
els and can take values in {0, .., q} or even continuous ones. 
In general, using the Shea and Ackers model of gene activa- 
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tion by recruitment |4, 13], one can construct a local free en- 
ergy associated to each signal integration node, that generates 
the constraints 1 14]. Here we consider the simplest possible 
scenario, GRl, treating the expression levels as Boolean vari- 
ables (i.e. setting q = 1), and the signal integration functions 
as Boolean functions {fb{x^b,i), ■■,x.i{b,Kt))}b=i..Ai- '^heco- 
ordinates i{b,l) point at the variable occupying place I in the 
6th constraint. The expression 

Xi{b,Q) ^ fb{Xi(b,l), ■■,Xi(b,Kt)), (1) 

imposes that the variable a;i(b.o) is the output of the function 
fb- For example, let us consider a graph with three variables 
and one constraint, labeled by = 1. Supposing the first two 
variables regulate the third, Ki = 2, i(l,0) = 3; «(1,1) ~ 
1; «(1,2) = 2. If, for instance, the transcription can occur 
only when both regulators are present, then /i (a; 1,2:2) = 1 
only \f xi,X2 — 1 (Boolean AND function). The local con- 
nectivity of a function node is kb = \ + Kb- Kb is called the 
"in-degree". The factor graph is also associated to an "out- 
degree" Ci = Ci — 1, where q is the number of functions 
connected to Xi. The fact that variables and constraints are 
Boolean make GRl an optimization problem of the satisfiabil- 
ity type (Sat) 1 10]. A very special one, given the structure of 
the signal integration functions. The properties of the model 
depend on the class of graphs and Boolean functions used. 
The results presented in this work hold for a rather large class 
of Boolean functions (see appendix A). 

Structure of the solution space. The phase-space structure 
has never been explored for this particular case. We set out to 
analyze the number of compatible configurations JV, for large 
N and M and random instances of the problem. To this end, 
consider the following argument, which focuses on the control 
exerted by the N — M "free" input genes on the compatible 
solutions. Together with the free genes, the network contains 
some genes which are regulated but do not regulate any other 
We can refer to them as "leaves" (Fig |2j. Given a realiza- 
tion, a leaf can always be adjusted to the output value of its 
function, which is then satisfied. Let us now remove from the 
graph each leaf and iterate this procedure (a variation of the 
so-called "leaf removal" algorithm |12]). There are two pos- 
sible outcomes: either erasing all the graph, leaving the free 
genes as isolated points, or stopping at a core of constraints 
that contains loops. In other words, the algorithm identifies 
the tree-like components of the graph connected to outputs. 



The core will be composed of Nc genes and Mc constraints. 
Let us now imagine to have a compatible configuration, flip 
a free gene, and try to construct another compatible configu- 
ration. In the case where the core is empty, since the graph 
is tree-like, it will always be possible to perform this opera- 
tion by local rearrangements, which propagate the output of 
the functions. Thus, flipping all the free genes, we find 2^^*^ 
satisfying configurations. In the presence of a core, because 
of the loops, flipping a free gene of the core will in general 
rearrange all the core genes, and is not guaranteed to lead to 
a new satisfying state. In fact, it is not granted there will be 
a compatible state to start with. Provided there is, the out- 
put propagation procedure can be applied to the non-core free 
genes to construct another. Thus, in general the N ~ M de- 




FIG. 2; Example of a leaf. The free gene that regulates it will not be 
in the core. The other two transcription factors are connected to the 
rest of the network, represented by a cartoon. 

grees of freedom given by the free genes cannot guarantee a 
solution. The relevant parameter is the number of core free 
genes Ac = Nc — Mc- Let us for the moment restrict to the 
case of fixed in-degree ("if-GRl"). In appendix B, we show 
that the average of N on the class of all Boolean functions is 
2W-A/ 'pJJ^g■ (^-) jf j-jjg (.Qj-g jg empty, the number of compat- 
ible configurations constructed by flipping the free genes are 
on average all the possible ones, (b) If the core is not empty, 
in the average case it will still be possible to construct 2^^*^ 
solutions by flipping the free genes. If Ac > 0, there will be 
2^^ clusters of solutions, and (c) in the case where the core 
contains no free genes there will be generally contradictions. 
We can thus distiguish the three regimes: (a) simple control, 
(b) complex control, (c) no control. Considering ensembles 
of random graphs, the regimes above depend on the value of 
Nc{N), McJM), so that a proper order parameter to adopt is 
7 = M/N flCf]. The phase diagram can be explored study- 
ing the rank and the kernel of the connectivity matrix in the 
ensemble. 

Example: the case of Poisson-distributed Ci. This is the 
simplest ensemble to consider, where p{c) = i^-^e"'^''' fl2ll . 
This probability distribution doesn't exclude that genes may 
appear in the functions that regulate them, leaving some free- 
dom of choice. In the simplest case, one finds Nc{j) — 
N{m — fc7(l — m)m^~^) and Afc(7) = N{jm''), where 
TO is defined by the relation m,{k) + —1 = 0. 

This gives the phase diagram as a function of 7. For 7 < 7(j 
there is simple control. For 7ci < 7 < 7c complex con- 
trol, and for 7 > 7c no control. For example, for 4-GRl, 
7d ~ 0.776 and 7c ~ 0.977. The regimes of gene con- 
trol correspond to thermodynamic phases, commonly referred 
to as SAT, HARD-SAT, and UNSAT phase respectively iTTIl . 
Furthermore, it is possible to show rigorously the clustering of 
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ing a Poisson distribution, of graphs with fixed in-degree and 
Poisson-distrubuted out-degree. The behavior of GRl on such 



FIG. 3: Ac as a function of 7 for different values of v in the multi- 
Poisson case. The discrete jumps are due to the onset of complex 
control phases for the different values of k. Ac can become negative 
many times, giving rise to reentrant no control phases (inset). The 
figure refers to a connectivity distribution with a cutoff at A; = 12. 



solutions argued above. In the simple control phase, one clus- 
ter contains all the solutions, and a free gene controls 0(1) 
other genes. The reason for this is that, for Poisson distributed 
out-degree, the average number of controlled genes is finite 
(c7), while the number of free genes is extensive. Conversely, 
in the complex control phase, the free genes within the core 
control 0{K) other genes (while there is still (9(1) control 
outside of the core). From a physical point of view, the clus- 
ters are separated by an extensive distance, i.e. by free en- 
ergy barriers. The number of clusters is related to the (com- 
putational) complexity E of the system, defined by the rela- 
tion logA/" ^ N{I1 + S). Here S, the entropy, measures the 
width of each cluster, while S "counts" the number of clus- 
ters. Therefore, by definition, E is directly related to Ac, i.e. 
to the partitioning of the free genes in and out of the core. 
How the system explores (or not) these clusters depends on 
details of its dynamics. Generically, one can say that the dy- 
namics in a cluster will be residual: many variables are fixed, 
the others can change. This matches a qualitative feature of 
many cells, where some genes are constantly expressed, and 
the rest vary fisll . 

Multi-Poisson phase diagram. While the fixed k case is 
useful to get some theoretical insight, the known transcription 
networks are far from having fixed in-degree. For example, in 
E. coli, the in-degree has Poisson distribution, while the out- 
degree resembles a power law. For this reason, a biologically 
more interesting case is when both the in- and the out-degree 
vary along the network. Considering p(fc|c) = ^^^—e~^^'^\ 
the conditioned probability that a variable is in c clauses 



of the k kind, we have p[c) = 



(Mlp-(fc7) 



p{k). The 



leaf removal equations can be applied separately to sets of 
clauses with a given connectivity, defining Nc =< Nc >k 
and Mc =< Mc >k, where < X >k= T,pPik)X{k). 
Choosing p{k) = Z^^{v)e^'^'^ does not affect the exponen- 
tial asymptotic decay of p{c) for large c. We can call this 
case multi-Poisson, as the graph is a superposition, follow- 
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FIG. 4: Phase diagram 7 — for the multi-Poisson case. The dashed 
line, a power law with exponent ~ 1.558, represents the mean 
value of the numerically evaluated critical parameter 7d(!^) for the 
simple-complex control transition of network realizations with A'^ = 
3 X 10''. 

a topology is nontrivially different from the fixed connectivity 
case. The main reason for this is that, while Ac (7) is still lo- 
cally decreasing, many new discontinuities emerge, due to the 
influence of clauses with different connectivities. This gives 
rise to different phenomena. Firstly, Ac can increase globally 
with increasing 7. Indeed, it does increase step-wise with 7 
after 7^, to decrease again before 7c. Its discrete jumps are 
due to the onset of complex control phases for the different 
values of k (Fig. O. This fact has an influence on the num- 
ber of compatible states as a function of 7. Secondly, Ac 
can become negative and then jump back to a positive value, 
creating a reentrant UNSAT phase (Fig.|4}- This means that, 
on average, by increasing the number of constraints one can 
pass from unsolvable problems to solvable ones. A heuristic 
explanation for this counterintuitive fact is that, at fixed A^, 
the addition of a constraint might connect a closed loop in 
the core to external free genes, thereby solving a contradic- 
tion. Interestingly, the simple to complex control boundary is 
a power-law in 7 (Fig.l^. 

Discussion. To conclude, we established a simple frame- 
work for the modeling of large scale transcription networks. 
It is a compatibility analysis on the constraints established by 
transcription. Its advantage is that, while avoiding to deal 
directly with the dynamics, it gives non-trivial results. In 
the absence of an explicit knowledge of the emergent time 
scales, we feel this is an appropriate approach, particularly 
in the Boolean approximation treated here. From a technical 
standpoint, GRl is different from other problems of the sat- 
isfiability kind because of the particular structure of its con- 
straints. This makes it possible to apply the leaf removal tech- 
nique, which is ineffective for other models, such as random- 
k-Sat 1 11]. From a general standpoint, our model shows that 
the "biological" complexity is not simply measured by the 
number of genes. A more proper indicator is Ac which de- 
pends on the order parameter v, or - roughly - on the number 
of transcription factors per gene. At fixed number of genes. 



it is known that this quantity increases in bacteria that need 
to react to more environments 1 17]. Imagining that prokary- 
otes are naturally found in a simple control phase, our phase 
diagram predicts an intrinsic limit to this process, represented 
by the phase boundary with the complex control phase. The 
multi-Poisson case gives an interesting prediction for the be- 
haviour of this boundary at fixed N. Namely, at criticality, 
the average number of constraints scales as a power-law with 
7. This feature, together with the existence of a core and the 
predictions on the control exerted by free genes, can possibly 
be tested experimentally. 

The approach presented here is new, and largely unex- 
plored. It is naturally fit to study networks with non-Boolean 
variables and probabilistic constraints. It can be of use for 
models that describe other regulation mechanisms than just 
transcription. More far-reached extensions include evolution- 
ary models. It is not clear yet exactly how useful it can be for 
the study of concrete biological networks. Together with the 
general trends, a biologically significant model has to be able 
to deal with the details of an individual realization the system. 
This is, we think, the main challenge to our approach, and the 
direction we are currently exploring. 

We would like to acknowledge interesting discussions with 
J. Berg, M. Caselle, L. Finzi, M. Leone, A. Sportiello, 
P.R. Tenwolde, R. Zecchina. We thank an anonymous referee 
for help improving our manuscript. 

APPENDIX 

A. GRl as a Satisfiability problem In this appendix we 
show how a realization of GRl can be formulated as a satisfi- 
ability problem (Sat) fioll . an optimization problem where N 
Boolean variables are constrained by O conjunctive normal 
form (CNF) constraints (i.e. by a Boolean polynomial con- 
structed as a product (A) of O disjunctive monomials (V)). 
Equation Q is equivalent to the XOR Boolean constraint 
lb = ~'(a;i(6,o) V/b). This can be recast in CNF, as 2^'' clauses 
involving kb variables. Each clause corresponds with a sim- 
ple map to each line Xi(b,i), ■■,x^(b,K^),Xi(bfl) (including the 
output) in the truth table of /{,. Namely, if the value of vari- 
able Xii^b.j) is 1 in the truth table line, it will be negated in 
the CNF clause. Vice vers a, it will be affirmed if its value is 
0. The formula / = Afc=i m ^b, defines a Satisfiability prob- 
lem on the variables xi, .., xn, with O = J2b=i m ■ A. 
realization of GRl differs from a Sat problem for the intrin- 
sically asymmetric form of the constraints, which "force" the 
value of x^bfl)- Moreover, considering random instances of 
the problem, the space of allowed functions of GRl is much 
smaller than the corresponding Sat problem. For example, for 
a clause with fixed connectivity k, while there are 2^ possible 
Boolean functions, all of which can appear in Sat, only 2^ 
of these can appear in GRl. The two above features make the 
leaf removal technique useful for the latter model. 

B. Average Number of Solutions In this appendix, we dis- 
cuss the average of M on the realizations of the constraints 
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{/,/}, for i^-GRl. One can write 



M 

■^i^: X! n ^i^i(.b,0); fb{Xi(b,l), ■■,Xi(^b,K,)))- 

X b=l 

Here, the randomness is contained: (i) in the specification 
of the network structure, / = (/i, ...,Im), i-e. in the coor- 
dinates i{b,l); (ii) in the specification of the functions / = 
(/i, /m) within the class of all Boolean functions. An 
overbar (~) indicates an average on both distributions, p{I) 
and p{f). Taking for T the class of all Boolean functions, 
it is straightforward to obtain 77 = 2^~*^, independently 
from the specification of the network structure. This result 
remains true considering a sub-family of functions that sat- 
isfy i^'Ylij^r v{f)]{x) — p. The reason for this is that one 

finds AT = X;5,/P(-0 ribli (P'^i;:'^.(b,o) + (1 -p)'5o;^.(.,o))- 



* e-mail address: lmcl@c urie.fr l 

^ e-mail address: bassetti@mi.infn.it 
[1] M. Babu, et at., Curr Opin Struct Biol 14, 283 (2004). 
[2] M. Herrgard, et ai, Curr Opin Biotechnol 15, 70 (2004). 
[3] M. Ptashne, A Genetic Switch (Cell Press, MA, 1992). 
[4] N. Buchler, etal, Proc Natl Acad Sci USA 100, 5136 (2003). 
[5] S. Shen-Orr, et al, Nat Genet 31, 64 (2002). 
[6] H. Mc Adams and A. Arkin, Proc Natl Acad Sci USA 94, 814 
(1997). 

[7] D. Gillespie, J. Phys. Chem. 81, 2340 61 (1977). 
[8] S. Kauffman, Tlie Origins of Order (Oxford Univ. Press, New 
York, 1993). 

[9] C. Gershenson, in Artificial Life IX Workshops and Tutorials 
(2004). 

[10] S. Mertens, Comput Sci and Eng 4, 31 (2002). 

[11] M. Mezard, etal.. Science 297, 812 (2002). 

[12] M. Mezard, et ai, J Stat Phys 505 (2003). See also S. Cocco 
et al, Phys Rev Lett 90, 047205 (2003) and L. Correale et al., 
cond-mat/04 12443 

[13] M. Shea and G. Ackers, J Mol Biol 181, 211 (1985). 

[14] M. Cosentino Lagomarsino, etal, (2005), q-bio.MN/0502017. 

[15] We should note that the conventional average of Af might be 
biased by the weight of exceptions IT^ . To access the typical 
behavior of the system, the correct quantity to compute is the 
"quenched average", log TV, usually accessed with the replica, 
or similar methods 1 1 1]. In the case under exam we have com- 
puted the annealed average log A/" (in general log A/" < log TV). 
For K-GRl with Poisson distributed out-degree, we have shown 
the presence of a self-averaging property for TV. That is, the 
quantity ([TV]^ — [aT] )/([A/'] ) vanishes in the thermody- 
namic limit ^ 00 so that, in the simple and complex control 
regimes, an equality holds between quenched and annealed av- 
erage 1 14]. 

[16] M. Mezard, et al. Spin Glass Theory and Beyond (World Sci- 
entific, Singapore, 1987). 
[17] I. Cases, et al.. Trends Microbiol 11, 248 (2003). 
[18] E. van Nimwegen, Trends Genet 19, 479 (2003). 



